Fast Quantum Modular Exponentiation 
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We present a detailed analysis of the impact on quantum modular exponentiation of architectural features and 
possible concurrent gate execution. Various arithmetic algorithms are evaluated for execution time, potential 
concurrency, and space tradeoffs. We find that to exponentiate an ?i-bit number, for storage space 100?i (twenty 
times the minimum 5n), we can execute modular exponentiation two hundred to seven hundred times faster than 
optimized versions of the basic algorithms, depending on architecture, for n — 128. Addition on a neighbor- 
only architecture is limited to 0(n) time while non-neighbor architectures can reach O(logn), demonstrating 
that physical characteristics of a computing device have an important impact on both real-world running time 
and asymptotic behavior Our results will help guide experimental implementations of quantum algorithms and 
devices. 

PACS numbers: 03.67.Lx, 07.05.Bx, 89.20.Ff 



I. INTRODUCTION 

Research in quantum computing is motivated by the_pos- 
sibility of enormous gains in computational time ll|,0|3lal- 
The process of writing programs for quantum computers nat- 
urally depends on the architecture, but the application of clas- 
sical computer architecture principles to the architecture of 
quantum computers has only just begun. 

Shor's algorithm for factoring large numbers in polynomial 
time is perhaps the most famous result to date in the field UJ. 
Since this algorithm is well defined and important, we will 
use it as an example to examine the relationship between ar- 
chitecture and program efficiency, especially parallel execu- 
tion of quantum algorithms. Shor's factoring algorithm con- 
sists of main two parts, quantum modular exponentiation, fol- 
lowed by the quantum Fourier transform. In this paper we will 
concentrate on the quantum modular exponentiation, both be- 
cause it is the most computationally intensive part of the algo- 
rithm, and because arithmetic circuits are fundamental build- 
ing blocks we expect to be useful for many algorithms. 

Fundamentally, quantum modular exponentiation is O (n"^ ) ; 
that is, the number of quantum gates or operations scales 
with the cube of the length in bits of the number to be fac- 
tored Ii5i|6i,|7||. It consists of 2n modular multiplications, each 
of which consists of 0(n) additions, each of which requires 
0{n) operations. However, 0{n^) operations do not nec- 
essarily require 0{n^) time steps. On an abstract machine, 
it is relatively straightforward to see how to reduce each of 
those three layers to 0(log7i) time steps, in exchange for 
more space and more total gates, giving a total running time of 
0(log^ n) if 0(n3) qubits are available and an arbitrary num- 
ber of gates can be executed concurrently on separate qubits. 
Such large numbers of qubits are not expected to be practi- 
cal for the foreseeable future, so much interesting engineering 
lies in optimizing for a given set of constraints. This paper 
quantitatively explores those tradeoffs. 
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This paper is intended to help guide the design and experi- 
mental implementation of actual quantum computing devices 
as the number of qubits grows over the next several genera- 
tions of devices. Depending on the post-quantum error correc- 
tion, application-level effective clock rate for a specific tech- 
nology, choice of exponentiation algorithm may be the differ- 
ence between hours of computation time and weeks, or be- 
tween seconds and hours. This difference, in turn, feeds back 
into the system requirements for the necessary strength of er- 
ror correction and coherence time. 

The Schonhage-Strassen multiplication algorithm is of- 
ten quoted in quantum computing research as being 
0(nlognloglog7i) for a single multiplication fS"]. However, 
simply citing Schonhage-Strassen without further qualifica- 
tion is misleading for several reasons. Most importantly, the 
constant factors matter ll42il : quantum modular exponentiation 
based on Schonhage-Strassen is only faster than basic O^n?) 
algorithms for more than approximately 32 kilobits. In this 
paper, we will concentrate on smaller problem sizes, and ex- 
act, rather than 0( ), performance. 

Concurrent quantum computation is the execution of more 
than one quantum gate on independent qubits at the same time. 
Utilizing concurrency, the latency, or circuit depth, to execute 
a number of gates can be smaller than the number itself. Cir- 
cuit depth is explicitly considered in Cleve and Watrous' par- 
allel implementation of the quantum Fourier transform 0], 
Gossett's quantum carry-save arithmetic llToll. and ZaUca's 
Schonhage-Strassen-based implementation fllj- Moore and 
Nilsson define the computational complexity class QNC to 
describe certain parallelizable circuits, and show which gates 
can be performed concurrently, proving that any circuit com- 
posed exclusively of Control-NOTs (CNOTs) can be paral- 
lelized to be of depth O(logn) using 0{n^) anciiiae on an 
abstract machine I12ll . 

We analyze two separate architectures, still abstract but 
with some important features that help us understand perfor- 
mance. For both architectures, we assume any qubit can be the 
control or target for only one gate at a time. The first, the AC, 
or Abstract Concurrent, architecture, is our abstract model. 
It supports CCNOT (the three-qubit Toffoli gate, or Control- 
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II. BASIC CONCEPTS 



A. Modular Exponentiation and Shor's Algoritlim 



AC 



NTC 



FIG. 1 : CCNOT constructions for our architectures AC and NTC. 
The box with the bar on the right represents the square root of X, 
and the box with the bar on the left its adjoint. Time flows left to 
right, each horizontal line represents a qubit, and each vertical line 
segment is a quantum gate. 



Control-NOT), arbitrary concurrency, and gate operands any 
distance apart without penalty. It does not support arbitrary 
control strings on control operations, only CCNOT with two 
ones as control. The second, the NTC, or Neighbor-only, 
Two-qubit-gate, Concurrent architecture, is similar but does 
not support CCNOT, only two-qubit gates, and assumes the 
qubits are laid out in a one-dimensional line, and only neigh- 
boring qubits can interact. The ID layout will have the high- 
est communications costs among possible physical topolo- 
gies. Most real, scalable architectures will have constraints 
with this flavor, if different details, so AC and NTC can be 
viewed as bounds within which many real architectures will 
fall. The layout of variables on this structure has a large im- 
pact on performance; what is presented here is the best we 
have discovered to date, but we do not claim it is optimal. 

The NTC model is a reasonable description of sev- 
eral important experimental approaches, including a one- 
dimensional chain of quantum dots jTsll . the original Kane 
proposal fl4ll . and the all-silicon NMR device jisll . Super- 
conducting qubits fl^ \vh may map to NTC, depending on 
the details of the qubit interconnection. 

The difference between AC and NTC is critical; beyond 
the important constant factors as nearby qubits shuffle, we will 
see in section llllBl that AC can achieve 0(log n) performance 
where NTC is limited to 0{n). 

For NTC, which does not support CCNOT directly, we 
compose CCNOT from a set of five two-qubit gates [Tsl, as 
shown in figure^ The box with the bar on the right represents 

1 1 + 2 1 — i 

the square root of X, \/ X = , . , . 

^ ^ 1 ~ I 1 -\- 1 

with the bar on the left its adjoint. We assume that this gate 
requires the same execution time as a CNOT. 

Section inireviews Shor's algorithm and the need for mod- 
ular exponentiation, then summarizes the techniques we em- 
ploy to accelerate modular exponentiation. The next subsec- 
tion introduces the best-known existing modular exponentia- 
tion algorithms and several different adders. Section Hill be- 
gins by examining concurrency in the lowest level elements, 
the adders. This is followed by faster adders and additional 
techniques for accelerating modulo operations and exponenti- 
ation. Section llVl shows how to balance these techniques and 
apply them to a specific architecture and set of constraints. 
We evaluate several complete algorithms for our architectural 
models. Specific gate latency counts, rather than asymptotic 
values, are given for 128 bits and smaller numbers. 



and the box 



Shor's algorithm for factoring numbers on a quantum com- 
puter uses the quantum Fourier transform to find the order r 
of a randomly chosen number x in the multiplicative group 
(modA^). This is achieved by exponentiating x, modulo N, 
for a superposition of all possible exponents a. Therefore, 
efficient arithmetic algorithms to calculate modular exponen- 
tiation in the quantum domain are critical. 

Quantum modular exponentiation is the evolution of the 
state of a quantum computer to hold 



|V')|0) mod N) 



(1) 



When is the superposition of all input states a up to a 
particular value 27V^, 
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(2) 



The result is the superposition of the modular exponentia- 
tion of those input states. 
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^ |a)|a;'' modiV) (3) 



Depending on the algorithm chosen for modular exponen- 
tiation, X may appear explicitly in a register in the quantum 
computer, or may appear only implicitly in the choice of in- 
structions to be executed. 

In general, quantum modular exponentiation algorithms are 
created from building blocks that do modular multiplication. 



|a)|0) \a)\af3 mod N) 



(4) 



where /? and N may or may not appear explicitly in quantum 
registers. This modular multiplication is built from blocks that 
perform modular addition. 



i)\Q) |a)|a + /3mod N) 



(5) 



which, in turn, are usually built from blocks that perform ad- 
dition and comparison. 

Addition of two n-bit numbers requires 0{n) gates. Mul- 
tiplication of two n-bit numbers (including modular multipli- 
cation) combines the convolution partial products (the one-bit 
products) of each pair of bits from the two arguments. This 
requires 0{n) additions of n-bit numbers, giving a gate count 
of 0{n^). Our exponentiation for Shor's algorithm requires 
271 multiplications, giving a total cost of 0{7i^). 

Many of these steps can be conducted in parallel; in classi- 
cal computer system design, the latency or circuit depth, the 
time from the input of values until the output becomes avail- 
able, is as important as the total computational complexity. 
Concurrency is the execution of more than one gate during the 
same execution time slot. We will refer to the number of gates 
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executing in a time slot as the concurrency or the concurrency 
level. Our goal through the rest of the paper is to exploit paral- 
lelism, or concurrency, to shorten the total wall clock time to 
execute modular exponentiation, and hence Shor's algorithm. 

The algorithms as described here run on logical qubits, 
which will be encoded onto physical qubits using quantum 
error correction (QEC) \]M- Error correction processes are 
generally assumed to be applied in parallel across the en- 
tire machine. Executing gates on the encoded qubits, in 
some cases, requires additional ancillae, so multiple concur- 
rent lo gical ga tes will require growth in physical qubit storage 
space l2(]H2lll . Thus, both physical and logical concurrency 
are important; in this paper we consider only logical concur- 
rency. 

B. Notation and Techniques for Speeding Up Modular 
Exponentation 

In this paper, we will use N as the number to be factored, 
and n to represent its length in bits. For convenience, we will 
assume that n is a power of two, and the high bit of N is one. 
X is the random value, smaller than N, to be exponentiated, 
and I a) is our superposition of exponents, with a < 2N'^ so 
that the length of a is 2n + 1 bits. 

When discussing circuit cost, the notation is 
{CCNOTs; CNOTs; NOTs) or {CNOTs; NOTs). 
The values may be total gates or circuit depth (la- 
tency), depending on context. The notation is some- 
times enhanced to show required concurrency and space, 
{CCNOTs; CNOTs; NOTs)#{concurrency] space). 

t is time, or latency to execute an algorithm, and S is space, 
subscripted with the name of the algorithm or circuit subrou- 
tine. When t or S is superscripted with AC or NTC, the 
values are for the latency of the construct on that architecture. 
Equations without superscripts are for an abstract machine as- 
suming no concurrency, equivalent to a total gate count for 
the AC architecture. R is the number of calls to a subroutine, 
subscripted with the name of the routine. 

m, g, /, p, b, and s are parameters that determine the be- 
havior of portions of our modular exponentiation algorithm, 
m, g, and / are part of our carry-select/conditional-sum adder 
(sec. 1111 B> . p and b are used in our indirection scheme 
(sec. 1111 E> . s is the number of multiplier blocks we can fit 
into a chosen amount of space (sec. lIIICt . 

Here we summarize the techniques which are detailed in 
following subsections. Our fast modular exponentiation cir- 
cuit is built using the following optimizations: 

• Select correct qubit layout and subsequences to imple- 
ment gates, then hand optimize (no penalty) l22.l23Ll24l 

• Look for concurrency within addition/multiplication 
(no space penalty, maybe noise penalty) (sees. 1111 Xt . 

• Select multiplicand using table/indirection (exponen- 
tial classical cost, linear reduction in quantum gate 
count)(ll29ll. sec. lmEl . 



• Do multiplications concurrently (linear speedup for 
small values, linear cost in space, small gate count in- 
crease; requires quantum-quantum (Q-Q) multiplier, as 
well as classical-quantum (C-Q) multiplier) (sec. lIIICT . 

• Move to e.g. carry-save adders (n^ space penalty for 
reduction to log time, increases total gate count)(fToll. 
sec. Ill C 4> conditional-sum adders (sec. IIII B 2> . or 
carry-lookahead adders (sec. lIICSl . 

• Reduce modulo comparisons, only do subtract N on 
overflow (small space penalty, linear reduction in mod- 
ulo arithmetic cost) (sec. IIII Dt . 



C. Existing Algorithms 

In this section we will review various components of the 
modular exponentiation which will be used to construct our 
parallelized version of the algorithm in section [111] There are 
many ways of building adders and multipliers, and choos- 
ing the correct one is a technology-dependent exercise |33l- 
Only a few classical techniques have been explored for quan- 
tum computation. The two most commonly cited modular 
exponentiation algorithms are those of Vedral, Barenco, and 
Ekert f^, which we will refer to as VBE, and Beckman, 
Chari, Devabhaktuni, and Preskill ||5|, which we will refer to 
as BCDP. Both BCDP and VBE algorithms build multipliers 
from variants of carry-ripple adders, the simplest but slowest 
method; Cuccaro et al. have recently shown the design of 
a smaller, faster carry-ripple adder Zalka proposed a carry- 
select adder; we present our design for such an adder in detail 
in section irtlB I Draper et al. have recently proposed a carry- 
lookahead adder, and Gossett a carry-save adder Beauregard 
has proposed a circuit that operates primarily in the Fourier 
transform space. 

Carry-lookahead (sec. IIIC 5> . conditional-sum 
(sec. IIIIB2t . and carry-save (sec. IIICTl all reach 0(log7i) 
performance for addition. Carry-lookahead and conditional- 
sum use more space than carry-ripple, but much less than 
carry-save. However, carry-save adders can be combined 
into fast multipliers more easily. We will see in sec. Illllhow 
to combine carry-lookahead and conditional-sum into the 
overall exponentiation algorithms. 



1. VBE Carry-Ripple 

The VBE algorithm (tI builds full modular exponentiation 
from smaller building blocks. The bulk of the time is spent 
in 20n2 - bn ADDERs jil. The full circuit requires 7n + 1 
qubits of storage: 2n + 1 for a, n for the other multiplicand, 
71 for a running sum, n for the convolution products, n for a 
copy of N, and n for carries. 

In this algorithm, the values to be added in, the convolu- 
tion partial products of a;", are programmed into a temporary 
register (combined with a superposition of |0) as necessary) 
based on a control line and a data bit via appropriate CCNOT 
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gates. The latency of ADDER and the complete algorithm are 

tADD = (4n-4;4n-3;0) (6) 

ty = (20r7,^ — 5n)tADD 

= (80n^ - lOO?!^ + 20n; 96n^ - 84n'^ + Ibn; 
8n^ -2n + l) (7) 

2. BCDP Carry-Ripple 

The BCDP algorithm is also based on a carry-ripple adder. 
It differs from VBE in that it more aggressively takes advan- 
tage of classical computation. However, for our purposes, this 
makes it harder to use some of the optimization techniques 
presented here. Beckman et al. present several optimiza- 
tions and tradeoffs of space and time, slightly complicating 
the analysis. 

The exact sequence of gates to be apphed is also dependent 
on the input values of N and x, making it less suitable for 
hardware implementation with fixed gates (e.g., in an optical 
system). In the form we analyze, it requires 5n + 3 qubits, 
including 2n + 1 for \a). Borrowing from their equation 6.23, 

ts = (54n3 - 127?7.2 + 108n - 29; 
IQir' + 15?7.2 - 38n + 14; 
20?!^ - 3871^ + 22n - 4) (8) 

3. Cuccaro Carry-Ripple 

Cuccaro et al. have recently introduced a carry-ripple cir- 
cuit, which we will call CU CA, which uses only a single an- 
cilla qubit Isill . The latency of their adder is (2n — 1; 5; 0) 
for the AC architecture. 

The authors do not present a complete modular exponentia- 
tion circuit; we will use their adder in our algorithms F and G. 
This adder, we will see in section lTV C II is the most efficient 
known for NTC architectures. 



Unfortunately, the paper's second contribution, Gossett's 
carry-ripple adder, as drawn in his figure 7, seems to be incor- 
rect. Once fixed, his circuit optimizes to be similar to VBE. 



J. Carry-Lookahead 

Draper, Kutin, Rains, and Svore have recently proposed 
a carry-lookahead adder, which we call QCLA 131. This 
method allows the latency of an adder to drop to Oilogn) 
for AC architectures. The latency and storage of their adder 
is 

tt'i = (41og2n + 3;4;2) 

#(n; 4n-logn- 1) (9) 

The authors do not present a complete modular exponentia- 
tion circuit; we will use their adder in our algorithm E, which 
we evaluate only for AC. The large distances between gate 
operands make it appear that QCLA is unattractive for NTC. 



6. Beauregard/Draper QFT-based Exponentiation 

Beauregard has designed a circuit for doing modular ex- 
ponentiation in only 2?! + 3 qubits of space |33], based 
on Draper's clever method for doing addition on Fourier- 
transformed representations of numbers li34ll . 

The depth of Beauregard's circuit is 0(ri'^), the same as 
VBE and BCDR However, we believe the constant factors on 
this circuit are very large; every modulo addition consists of 
four Fourier transforms and five Fourier additions. 

Fowler, Devitt, and Hollenberg have simulated Shor's algo- 
rithm using Beauregard's algorithm, for a class of machine 
they call linear nearest neighbor (LNN) 

01 El- LNN 

corresponds approximately to our NTC. In their implemen- 
tation of the algorithm, they found no significant change in 
the computational complexity of the algorithm on LNN or 
an AC-like abstract architecture, suggesting that the perfor- 
mance of Draper's adder, like a carry-ripple adder, is essen- 
tially architecture-independent. 



4. Gossett Carry-Save and Carry-Ripple 

Gossett's arithmetic is pure quantum, as opposed to the 
mixed classical-quantum of BCDP. Gossett does not provide a 
full modular exponentiation circuit, only adders, multipliers, 
and a modular adder based on the important classical tech- 
niques of carry-save arithmetic llOll . 

Gossett's carry-save adder, the important contribution of 
the paper, can run in O(logn) time on AC architectures. It 
will remain impractical for the foreseeable future due to the 
large number of qubits required; Gossett estimates 8n^ qubits 
for a full multiplier, which would run in 0(log^ n) time. It 
bears further analysis because of its high speed and resem- 
blance to standard fast classical multipliers. 



III. RESULTS: ALGORITHMIC OPTIMIZATIONS 

We present our concurrent variant of VBE, then move to 
faster adders. This is followed by methods for performing ex- 
ponentiation concurrently, improving the modulo arithmetic, 
and indirection to reduce the number of quantum multiplica- 
tions. 



A. Concurrent VBE 

In figure |2] we show a three-bit concurrent version of the 
VBE ADDER. This figure shows that the delay of the con- 
current ADDER is (3n - 3)CCN0T + {2n - 3)CN0T, or 
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FIG. 2: Three-bit concurrent VBE ADDER, AC abstract machine. 
Gates marked with an 'x' can be deleted when the carry in is known 
to be zero. 
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(3n - 3; 2n - 3; 0) 



(10) 



a mere 25% reduction in latency compared to the unoptimized 
(4?! - 4; 4n - 3; 0) of equation|6l 

Adapting equation^ the total circuit latency, minus a few 
small corrections that fall outside the ADDER block proper, 
is 



t 



AC 



(20n2 - 57i)t^g^ 
(60?i^ - 7bn^ + 15n; 



4071^ - 70n2 + 15n; 0) (11) 
This equation is used to create the first entry in tablellll 

B. Carry-Select and Conditional-Sum Adders 

Carry-select adders concurrently calculate possible results 
without knowing the value of the carry in. Once the carry in 
becomes available, the coiTect output value is selected using 
a multiplexer (MUX). The type of MUX determines whether 
the behavior is 0{y/n) or 0(log n). 



FIG. 3: Three-bit carry-select adder (CSLA) with multiplexer 
(MUX), fli and hi are addends. The control-SWAP gates in the MUX 
select either the qubits marked Ci„ = 1 or Ci„ = depending on the 
state of the carry in qubit Ci„ . Si qubits are the output sum and ki are 
internal carries. 
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FIG. 4: Block-level diagram of four-group carry-select adder, and 
bi are addends and Si is the sum. Additional ancillae not shown. 



1. 0{^/n) Carry-Select Adder 

The bits are divided into g groups of m bits each, n = gm. 
The adder block we will call CSLA, and the combined adder, 
MUXes, and adder undo to clean our ancillae, CSLAMU. The 
CSLAs are all executed concurrently, then the output MUXes 
are cascaded, as shown in figure |3 The first group may have 
a different size, /, than to, since it will be faster, but for the 
moment we assume they are the same. 

Figure |3] shows a three-bit carry-select adder. This gener- 
ates two possible results, assuming that the carry in will be 



zero or one. The portion on the right is a MUX used to se- 
lect which caiTy to use, based on the caiTy in. All of the out- 
puts without labels are ancillae to be garbage collected. It is 
possible that a design optimized for space could reuse some 
of those qubits; as drawn a full carry-select circuit requires 
5to — 1 qubits to add two TO-bit numbers. 

The larger TO-bit carry-select adder can be constructed so 
that its internal delay, as in a normal carry-ripple adder, is 
one additional CCNOT for each bit, although the total num- 
ber of gates increases and the distance between gate operands 
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increases. 

The latency for the CSLA block is 

t^g = (m; 2; 0) (12) 

Note that this is not a "clean" adder; we still have ancillae to 
return to the initial state. 

The problem for implementation will be creating an effi- 
cient MUX, especially on NTC. Figure|4]makes it clear that 
the total carry-select adder is only faster if the latency of MUX 
is substantially less than the latency of the full carry-ripple. It 
will be difficult for this to be more efficient that the single- 
CCNOT delay of the basic VBE carry -ripple adder on NTC. 
On AC, it is certainly easy to see how the MUX can use a 
fanout tree consisting of more ancillae and CNOT gate s to 
distribute the carry in signal, as suggested by Moore I12ll . al- 
lowing all MUX Fredkin gates to be executed concurrently. A 
full fanout requires an extra m qubits in each adder. 

In order to unwind the ancillae to reuse them, the simplest 
approach is the use of CNOT gates to copy our result to an- 
other n-bit register, then a reversal of the circuitry. Count- 
ing the copy out for ancilla management, we can simpUfy the 
MUX to two CCNOTs and a pair of NOTs. 

The latency of the carry ripple from MUX to MUX (not 
qubit to qubit) can be arranged to give a MUX cost of (4g + 
2m — 6; 0; 2g — 2). This cost can be accelerated somewhat 
by using a few extra qubits and "fanning out" the carry. For 
intermediate values of m, we will use a fanout of 4 on AC, 
reducing the MUX latency to (4g + m/2 - 6; 2; 2g ~ 2) in 
exchange for 3 extra qubits in each group. 

Our space used for the full, clean adder is (6m — l)(g — 
1) + 3/ + 4g when using a fanout of 4. 

The total latency of the CSLA, MUX, and the CSLA undo 

is 

.AC _ o+AC I ^AC 
''SEM — '^''CS ''MUX 

= (4g + 5m/2-6; 6; 2^-2) (13) 

Optimizing for AC, based on equation [O] the delay will be 
the minimum when m ^ yj%n/b. 

Zalka was the first to propose use of a carry-select adder, 
though he did not refer to it by name iHHi . His analysis does 
not include an exact circuit, and his results differ slightly from 
ours. 



2. O(logn) Conditional Sum Adder 

As described above, the carry-select adder is 0(m + g), for 
n = rag, which minimizes to be 0(^/n). To reach 0(log7i) 
performance, we must add a multi-level MUX to our carry- 
select adder. This structure is called a conditional sum adder, 
which we will label CSUM. Rather than repeatedly choosing 
bits at each level of the MUX, we will create a multi-level dis- 
tribution of MUX select signals, then apply them once at the 
end. Figure |5] shows only the carry signals for eight CSLA 
groups. The e signals in the figure are our effective swap 
control signals. They are combined with a carry in signal to 
control the actual swap of variables. In a full circuit, a ninth 
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FIG. 5: 0(log7i) MUX for conditional-sum adder, for g — 9 (the 
first group is not shown). Only the dj carry out fines from each m- 
qubit block are shown, where i is the block number and j is the carry 
in value. At each stage, the span of correct effective swap control 
lines Cij doubles. After using the swap control lines, all but the last 
must be cleaned by reversing the circuit. Unlabeled lines are ancillae 
to be cleaned. 



group, the first group, will be a carry -ripple adder and will cre- 
ate the carry in; that carry in will be distributed concurrently 
in a separate tree. 

The total adder latency will be 

.AC _ r,,AC . 

'CSUM — '^'CS "t" 

(2riog2(g-l)l -1) X (2; 0; 2) 
+ (4; 0; 4) 
= (2j7i + 4[log2(g-l)l+2;4; 

4riog2(5-l)l +2) (14) 

where \x~\ indicates the smallest integer not smaller than x. 

For large n, this generally reaches a minimum for small 
m, which gives asymptotic behavior ~ 4 logj n, the same as 
QCLA. CSUM is noticeably faster for small n, but requires 
more space. 

The MUX uses [3(.g - l)/2] - 2 qubits in addition to the 
internal carries and the tree for dispersing the carry in. Our 
space used for the full, clean adder is (6m — l)(g — 1) + 3/ + 
r3(g-l)/2-2 + (n-/)/2l. 
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FIG. 6: Concurrent modular multiplication in modular exponentia- 
tion for s = 2. QSET simply sets the sum register to tlie appropriate 
value. 



C. Concurrent Exponentiation 

Modular exponentiation is often drawn as a string of mod- 
ular multiplications, but Cleve and Watrous pointed out that 
these can easily be parallelized, at linear cost in space @| . We 
always have to execute 2n multiplications; the goal is to do 
them in as few time-delays as possible. 

To go (almost) twice as fast, use two multipliers. For four 
times, use four. Naturally, this can be built up to n multipliers 
to multiply the necessary 2n + 1 numbers, in which case a 
tree recombining the partial results requires log2 n quantum- 
quantum (Q-Q) multiplier latency times. The first unit in each 
chain just sets the register to the appropriate value if the con- 
trol line is 1, otherwise, it leaves it as 1. 

For s multipliers, s < n, each multiplier must combine 
r = [(2n + l)/sj or r + 1 numbers, using ?- — 1 or r multi- 
plications (the first number being simply set into the running 
product register), where [xj indicates the largest integer not 
larger than x. The intermediate results from the multipliers 
are combined using [log2 s] Q-Q multiplication steps. 

For a parallel version of VBE, the exact latency, including 
cases where rs ^ 2n + 1, is 

Rv = 2r + l+ riog2(r(s-2n-l + rs)/4] 

+2n+l-rs)~\ (15) 

times the latency of our multiplier. For small s, this is 0{n); 
for larger s. 
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FIG. 7: More efficient modulo adder. The blocks with arrows set the 
register contents based on the value of the control line. The position 
of the black block indicates the running sum in our output. 



last two of the five blocks are required to undo the overflow 
bit. 

Figure shows a more efficient modulo adder than VBE, 
based partly on ideas from BCDP and Gossett. It requires 
only three adder blocks, compared to five for VBE, to do one 
modulo addition. The first adder adds to our running sum. 
The second conditionally adds 2" — x^ ~ N or 2" — x^ , de- 
pending on the value of the overflow bit, without affecting the 
overflow bit, arranging it so that the third addition of x^ will 
overflow and clear the overflow bit if necessary. The blocks 
pointed to by arrows are the addend register, whose value is 
set depending on the control lines. Figure0uses n fewer bits 
than VBE's modulo arithmetic, as it does not require a register 
to hold N. 

In a slightly different fashion, we can improve the perfor- 
mance of VBE by adding a number of qubits, p, to our result 
register, and postponing the modulo operation until later. This 
works as long as we don't allow the result register to overflow; 
we have a redundant representation of modulo N values, but 
that is not a problem at this stage of the computation. 

The largest number that doesn't overflow for p extra qubits 
is 2"+^ — 1; the largest number that doesn't result in subtrac- 
tion is 2"+P~^ — 1. We want to guarantee that we always clear 
that high-order bit, so if we subtract bN, the most iterations 
we can go before the next subtraction is b. 

The largest multiple of we can subtract is [2"+P^^/iVj . 
Since 2"^^ < N < 2", the largest b we can allow is, in 
general, 2^^^. 

For example, adding three qubits, p = 3, allows 6 = 4, 
reducing the 20 ADDER calls VBE uses for four additions to 9 
ADDER calls, a 55% performance improvement. As p grows 
larger, the cost of the adjustment at the end of the calculation 
also grows and the additional gains are small. We must use 3p 
adder calls at the end of the calculation to perform our final 
modulo operation. Calculations suggest that p of up to 10 or 
1 1 is still faster. 

The equation below shows the number of calls to our adder 
block necessary to make an n-bit modulo multiplier. 



lim 0{n/s + log s) = 0(log n) 



(16) 



Rm = n{2b+l)/b 



(17) 



D. Reducing the Cost of Modulo Operations 

The VBE algorithm does a trial subtraction of N in each 
modulo addition block; if that underflows, is added back in 
to the total. This accounts for two of the five ADDER blocks 
and much of the extra logic to compose a modulo adder. The 



E. Indirection 

We have shown elsewhere that it is possible to build a table 
containing small powers of x, from which an argument to a 
multiplier is selected |2^ . In exchange for adding storage 
space for 2"" n-bit entries in a table, we can reduce the number 
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FIG. 8: Implicit Indirection. The arrows pointing to blocks indicate 
the setting of the addend register based on the control lines. This 
sets the addend from a table stored in classical memory, reducing the 
number of quantum multiplications by a factor of w in exchange for 
2™ argument setting operations. 



lenable> ' 



w=2 



I aj> - 

ltmp=0> - 
lenable> ' 



w=3 



I aj> - 
I - 
I aj> - 

ltmp=0> - 
ltmp,=0>- 
lenable> - 



i: 



w=4 



FIG. 9: Argument setting for indirection for different values of w, 
for the AC architecture. For the w — 4 case, the two CCNOTs on 
the left can be executed concurrently, as can the two on the right, for 
a total latency of 3. 



of multiplications necessary by a factor of w. This appears to 
be attractive for small values of w, such as 2 or 3. 

In our prior work, we proposed using a large quan- 
tum memory, or a quantum-addressable classical memory 
(QACM) 13711 . Here we show that the quantum storage space 
need not grow; we can implicitly perform the lookup by 
choosing which gates to apply while setting the argument. In 
figure|8] we show the setting and resetting of the argument for 
w — 2, where the arrows indicate CCNOTs to set the appro- 
priate bits of the register to 1 . The actual implementation can 
use a calculated enable bit to reduce the CCNOTs to CNOTs. 
Only one of the values .t", x^, x^,ot will be enabled, based 
on the value of jaioo). 

The setting of this input register may require propagating 
I a) or the enable bit across the entire register. Use of a few 
extra qubits (2™"^) wiU allow the several setting operations 
to propagate in a tree. 



t 



AC 
ARG 



13;0;1) 



(4;0;4) w 
w 



2 

3,4 



(18) 



For w ~ 2 and w ~ 3, we calculate that setting the argu- 
ment adds (4; 0; 4)#(4, 5) and (24; 0; 8)#(8, 9), respectively, 
to the latency, concurrency and storage of each adder. We 
create separate enable signals for each of the 2"' possible ar- 
guments and pipeline flowing them across the register to set 
the addend bits. We consider this cost only when using indi- 
rection. Figure|5]shows circuits for w ^ 2,3, 4. 

Adapting equation^] to both indirection and concurrent 
multiplication, we have a total latency for our circuit, in mul- 
tiplier calls, of 

Ri = 2r+l+[log2(r(s-2n-l+rs)/4]+2n+l-rs)] (19) 
where r = [[(2n + l)/ti;]/sj. 

IV. EXAMPLE: EXPONENTIATING A 128-BIT NUMBER 

In this section, we combine these techniques into complete 
algorithms and examine the performance of modular expo- 
nentiation of a 128-bit number. We assume the primary en- 
gineering constraint is the available number of qubits. In sec- 
tioning] we showed that using twice as much space can al- 
most double our speed, essentially linearly until the log term 



begins to kick in. Thus, in managing space tradeoffs, this 
will be our standard; any technique that raises performance by 
more than a factor of c in exchange for c times as much space 
will be used preferentially to parallel multiplication. Carry- 
select adders (sec. IIII B> easily meet this criterion, being per- 
haps six times faster for less than twice the space. 

Algorithm D uses lOOri space and our conditional-sum 
adder CSUM. Algorithm E uses lOOn space and the carry- 
lookahead adder QCLA. Algorithms F and G use the Cuc- 
caro adder and lOO?! and minimal space, respectively. Pa- 
rameters for these algorithms are shown in table U We have 
included detailed equations for concurrent VBE and D and 
numeric results in table Hi] The performance ratios are based 
only on the CCNOT gate count for AC, and only on the 
CNOT gate count for NTC. 



A. Concurrent VBE 

On AC, the concurrent VBE ADDER is (3n - 3; 2n - 
3; 0) = (381; 253; 0) for 128 bits. This is the value we use in 
the concurrent VBE line in table|ll] This will serve as our best 
baseline time for comparing the effectiveness of more drastic 
algorithmic surgery. 

Figure [TO] shows a fully optimized, concurrent, but other- 
wise unmodified version of the VBE ADDER for three bits on 
a neighbor-only machine {NTC architecture), with the gates 
marked 'x' in figure|2eliminated. The latency is 



,NTC 
^ADD 



(20n- 15;0)#(2; 3n + l) 



(20) 



or 45 gate times for the three -bit adder A 128-bit adder will 
have a latency of (2545; 0). The diagram shows a concurrency 
level of three, but simple adjustment of execution time slots 
can limit that to two for any n, with no latency penalty. 

The unmodified full VBE modular exponentiation algo- 
rithm, consists of 20^2 - 5?! = 327040 ADDER calls plus 
minor additional logic. 



t 



NTC 
V 



= (2071^- 
= (400?!^ 



'^^Fadd 

- 400^2 + 75n; 0) 



(21) 
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algorithm 


adder 


modulo 


indirect 


multipliers (s) 


space 


concurrency 


concurrent VBE 


VBE 


VBE 


N/A 


1 


897 


2 


algorithm D 


CSUM(m = 4) 


p = 11,6 = 1024 


w = 2 


12 


11969 


126 X 12 = 1512 


algorithm E 


QCLA 


p = 10, 6 = 512 


w = 2 


16 


12657 


128 X 16 = 2048 


algorithm F 


CUCA 


p = 10, 6 = 512 


It) = 4 


20 


11077 


20 X 2 = 40 


algorithm G 


CUCA 


fig-El 


w = 4 


1 


660 


2 



TABLE I: Parameters for our algorithms, chosen for 128 bits. 



algorithm 


AC 


NTC 




gates 


perf. 


gates 


perf. 


concurrent VBE 
algorithm D 
algorithm E 
algorithm F 
algorithm G 


(1.25 X 10** 
(2.19 X lO'^ 
(1.71 X 10^ 
(7.84 X 10^ 
(1.50 X 10^ 


8.27 X 10' 
2.57 X 10" 
1.96 X 10"* 
1.30 X 10'' 
2.48 X lO'' 


0.00 X 10") 
1.67 X 10^) 
2.93 X 10"*) 
4.10 X lO'*) 
7.93 X 10^) 


1.0 
569.8 
727.2 
158.9 

8.3 


(8.32 X 10**; 0.00 x 10") 
N/A 
N/A 

(4.11 X lO''; 4.10 X 10*) 
(7.87 X lO'^; 7.93 x 10^) 


1.0 
N/A 
N/A 
202.5 
10.6 



TABLE II: Latency to factor a 128-bit number for various architectures and choices of algorithm. AC, abstract concurrent architecture. NTC 
neighbor-only, two-qubit gate, concurrent architecture, perf, performance relative to VBE algorithm for that architecture, based on CCNOTs 
for AC and CNOTs for NTC. 



B. Algorithm D 

The overall structure of algorithm D is similar to VBE, with 
our conditional-sum adders instead of the VBE carry-ripple, 
and our improvements in indirection and modulo. As we do 
not consider CSUM to be a good candidate for an algorithm 
for NTC, we evaluate only for AC. Algorithm D is the fastest 
algorithm for n = 8 and n = 16. 



is slightly faster than QCLA, its significantly larger space con- 
sumption means that in our lOOn fixed-space analysis, we can 
fit in 16 multipliers using QCLA, compared to only 12 using 
CSUM. This allows the overall algorithm E to be 28% faster 
than D for 128 bits. 



1. Algorithms F and G 



RiRm 

+ 3ptcSUM 



(22) 



Letting r = l\{2n + l)/'w~\ /s\, the latency and space re- 
quirements for algorithm D are 



and 



Sd 



2r + 1 + riog2(r(s - 2n - 1 + r.s)/4] 
+2n+ 1 - rs)]n{2b+ l)/b 
x((2TO + 4[log2((7-l)l +2; 4; 
4[log2(.9-l)l +2) + (4; 0; 4)) 
-f3p(2m + 4riog2(5-l)l +2; 4; 
4[log2(5-l)l +2) (23) 

= s{ScSUM 

+2^ + l+p + n) + 2n+l 
= s{7n-3m- g + 2'" +p 

+ [3(3-1)72- 2+ (n-m)/2l) 

+2n + 1 (24) 



C. Algorithm E 

Algorithm E uses the carry-lookahead adder QCLA in 
place of the conditional-sum adder CSUM. Although CSUM 



The Cuccaro carry-rippler adder has a latency of (lOn + 
5; 0) for NTC. This is twice as fast as the VBE adder. We 
use this in our algorithms F and G. Algorithm F uses lOOn 
space, while G is our attempt to produce the fastest algorithm 
in the minimum space. 



D. Smaller n and Different Space 

FigureFTTIshows the execution times of our three fastest al- 
gorithms for n from eight to 128 bits. Algorithm D, using 
CSUM, is the fastest for eight and 16 bits, while E, using 
QCLA, is fastest for larger values. The latency of 1072 for 
71 = 8 bits is 32 times faster than concurrent VBE, achieved 
with 60?i = 480 qubits of space. 

Figure [T2I shows the execution times for n = 128 bits for 
various amounts of available space. All of our algorithms have 
reached a minimum by 240?i space (roughly 1.9n^). 



E. Asymptotic Behavior 

The focus of this paper is the constant factors in modular 
exponentiation for important problem sizes and architectural 
characteristics. However, let us look briefly at the asymptotic 
behavior of our circuit depth. 

In section lni CI we showed that the latency of our complete 
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FIG. II: Execution time for our algorithms for space lOOn on the 
AC architecture, for varying value of n. 
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FIG. 12: Execution time for our algorithms for 128 bits on the AC 
architecture, for varying multiples of n space available. 



algorithm is 

0{n/s + logs) X latency of multiplication (25) 
as we parallelize the multiplication using s multiplier blocks. 



Our multiplication algorithm is still 

0{n) X latency of addition (26) 

Algorithms D and E both use an O(logn) adder. Combin- 
ing equations I25landl26l with the adder cost, we have asymp- 
totic circuit depth of 

t^C ^ ^AC ^ o{{n\ogn){n/s + logs)) (27) 

for algorithms D and E. As s n, these approach 

0{n log^ n) and space consumed approaches 0{n^). 

Algorithm F uses an 0{n) adder, whose asymptotic behav- 
ior is the same on both AC and NTC, giving 

^AC^^^TC^O((n2)(n/s + logs)) (28) 

approaching O(n^logn) as space consumed approaches 
0(n2). 

This compares to asymptotic behavior of 0{n^) for VBE, 
BCDP, and algorithm G, using 0{n) space. The limit of per- 
formance, using a carry-save multiplier and large s, will be 
0(log^ n) in 0{n^) space. 

V. DISCUSSION AND FUTURE WORK 

We have shown that it is possible to significantly acceler- 
ate quantum modular exponentiation using a stable of tech- 
niques. We have provided exact gate counts, rather than 
asymptotic behavior, for the n = 128 case, showing algo- 
rithms that are faster by a factor of 200 to 700, depending on 
architectural features, when lOO?! qubits of storage are avail- 
able. For n = 1024, this advantage grows to more than a 
factor of 5,000 for non-neighbor machines (AC). Neighbor- 
only (NTC) machines can run algorithms such as addition 
in 0{n) time at best, when non-neighbor machines (AC) can 
achieve 0(logri,) performance. 

In this work, our contribution has focused on parallelizing 
execution of the arithmetic through improved adders, concur- 
rent gate execution, and overall algorithmic structure. We 
have also made improvements that resulted in the reduction 
of modulo operations, and traded some classical for quantum 
computation to reduce the number of quantum operations. It 
seems likely that further improvements can be found in the 
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overall structure and by more closely examining the construc- 
tion of multipliers from adders Isoll . We also intend to pursue 
multipliers built from hybrid carry-save adders. 

The three factors which most heavily influence perfor- 
mance of modular exponentiation are, in order, concurrency, 
the availability of large numbers of application-level qubits, 
and the topology of the interconnection between qubits. With- 
out concurrency, it is of course impossible to parallelize the 
execution of any algorithm. Our algorithms can use up to 
~ 2n^ application-level qubits to execute the multiplications 
in parallel, executing 0{n) multiplications in 0(log7T,) time 
steps. Finally, if any two qubits can be operands to a quantum 
gate, regardless of location, the propagation of information 
about the carry allows an addition to be completed in 0(log n) 
time steps instead of 0{n). We expect that these three factors 
will influence the performance of other algorithms in similar 
fashion. 

Not all physically realizable architectures map cleanly to 
one of our models. A full two-dimensional mesh, such as neu- 
tral atoms in an optical lattice Issll . and a loose trellis topol- 
ogy i39il probably fall between AC and NTC. The behavior 
of the scalable ion trap ll40ll is not immediately clear. We have 



begun work on expanding our model definitions, as well as 
additional ways to characterize quantum computer architec- 
tures. 

The process of designing a large-scale quantum computer 
has only just begun. Over the coming years, we expect ad- 
vances in the fundamental technology, the system architec- 
ture, algorithms, and tools such as compilers to all contribute 
to the creation of viable quantum computing machines. Our 
hope is that the algorithms and techniques in this paper will 
contribute to that engineering process in both the short and 
long term. 
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